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Abstract 

We study the adaptation of Link Gram- 
mar Parser to the biomedical sublanguage 
with a focus on domain terms not found 
in a general parser lexicon. Using two 
biomedical corpora, we implement and 
evaluate three approaches to addressing 
unknown words: automatic lexicon ex- 
pansion, the use of morphological clues, 
and disambiguation using a part-of-speech 
tagger We evaluate each approach sep- 
arately for its effect on parsing perfor- 
mance and consider combinations of these 
approaches. In addition to a 45% in- 
crease in parsing efficiency, we find that 
the best approach, incorporating informa- 
tion from a domain part-of-speech tagger, 
offers a statistically significant 10% rela- 
tive decrease in error. The adapted parser 
is available under an open-source license 
at http : / /www . it . utu . f i/biolg. 

1 Introduction 

In applying general parsers to specific domains, 
adaptation is often necessary to achieve high pars- 
ing performance (see e.g. ( |Sekine, 1997 )). Sub- 
language is defined by Grishman (120011) as a spe- 
cialized form of a natural language that is used 
within a particular domain or subject matter It is 
characterized by specialized vocabulary, semantic 
relationships, and in many cases syntax. 

In this paper, we study lexical adaptation, 
that is, adaptation addressing the specialized vo- 
cabulary. This is an important part of the 
process of customizing a general parser to a 
sublanguage. Among other issues, the un- 
known word rate increases dramatically when 



moving from general language to increasingly 
technical domains such as that of biomedicine 
( |Lease and Chamiak, 2005 1 ). This can lead to 
increased ambiguity, reduced parsing perfor- 
mance, and errors in establishing the correct re- 
lationships between words for semantic mining 
i PyysaloetaL, 2006| i. 

Until recently. Information Extraction (IE) sys- 
tems for mining semantic relationships from texts 
of technical sublanguages avoided full syntac- 
tic parsing. The quality of pai^sing has a well- 
established effect on the performance of IE sys- 
tems, and the accuracy of general parsers in tech- 
nical domains is comparatively low. Additionally, 
many domain-specific parsers lack portability to 
a new domain. Finally, the time required for full 
parsing is also a problem for IE systems. But the 
biomedical IE community now faces limitations 
in pattern-matching (Blaschke et al., I999t and 
shallow parsing (Pu stejovsky et al., 2002| l meth- 
ods that are inefficient in the processing of long 
distance dependencies and complex sentences. 
Recent advances in parsing techniques have fur- 
ther created an increased interest in the adaptation 
of full parsers. 

Here, we consider the lexical adaptation 
of a full parser, the Link Grammar Parser' 
(LGP) of Sleator of Temperley (19911. The 
choice of parser addresses the recent in- 
terest in LGP in the biomedical IE com- 
munity ( |Ding et al., 2003] ISzolovits, 2()03l 
lAhmed et al., 20051 [Alphonse et al., 2004| ). Our 
evaluation is performed using two corpora of 
sentences from Medline abstracts with a focus 
on protein-protein interactions, the identification 
of which is the key aim of most biomedical IE 



'http://www.link.cs.cmu.edu/link/ 



systems. 

Recently, two approaches addressing unknown 
words in applying LGP to the biomedical domain 
have been proposed. Szolovits (2003 ) introduced a 
method for heuristically mapping terminology be- 
tween lexicons and applied this mapping to aug- 
ment the LGP dictionary with terms from the 
UMLS Specialist Lexicon^. Based on an analy- 
sis of a domain corpus, two of the authors have 
proposed an extension of the morpho-guessing 
system of LGP for disambiguating domain terms 
based on their suffixes ( |Aubin et al., 2005T l. The 
effect of the proposed extensions on parsing per- 
formance against an annotated reference corpus 
was not evaluated in these two studies. 

Here we analyze the effect of these lexical ex- 
tensions using an annotated biomedical corpus. 
We further propose, implement and evaluate in de- 
tail a third approach to resolving unknown words 
in LGP using information from a part-of-speech 
(POS) tagger. 

2 Link Grammar Parsing 

The link grammar formalism is closely related to 
dependency formalism. It is based on the notion of 
typed links connecting words. The result of pars- 
ing is one or more ordered parses, termed linkages. 
A linkage consists of a set of links connecting the 
words of a sentence so that links do not cross, no 
two links connect the same two words, and the 
types of the links satisfy the linking requirements 
given to each word in the lexicon. An example 
linkage is given below. 




this ORF was identified as the padA gene . 

Since the link grammar is rule-based and the 
parser makes no use of statistical methods, LGP 
is a good candidate for adaptation to new domains 
where annotated corpus data is rarely available. 
The lexical adaptation approaches we evaluate fur- 
ther require only a light linguistic analysis of do- 
main language. 

LGP has three different methods applied in a 
cascade to handle vocabulary: dictionary lookup, 
morpho-guessing and unknown word guessing. 
The LGP dictionary enumerates all words, includ- 
ing inflected forms, and grammar rules are en- 
coded through the linking requirements associated 
with the words. Some unknown words are as- 

^http://specialist.nlm.nih.gov/ 



signed linking requirements based on their mor- 
phological features, such as the suffix -ly for ad- 
verbs. This system is termed morpho-guessing 
(MG). Finally, words that are neither found in the 
parser dictionary nor recognized by its morpho- 
guessing rules are assigned all possible combina- 
tions of the generic verb, noun and adjective link- 
ing requirements. This general approach is, in 
principle, always capable of generating the correct 
combination of linking requirements for unknown 
words. However, with an increasing number of 
unknown words in a sentence, the approach leads 
to a combinatorial explosion in the number of pos- 
sible linkages and a rapid increase in parsing time 
and decrease in parsing performance. The parser 
is also time-limited: when a sentence cannot be 
parsed within a user-specified time limit, LGP at- 
tempts parses using more efficient, but restricted 
settings, leading to reduced parse quality. 

When parsing sublanguages that contain many 
words that are not in the lexicon, it is therefore 
beneficial to attempt to resolve unknown words to 
reduce ambiguity in parsing. 

3 Lexical adaptations 

We evaluate three approaches to lexical adap- 
tation: lexicon extension, morphological clues, 
and POS tagging. The approaches primarily in- 
volve open-class words and use linking require- 
ments from the original LGP. Closed-class words, 
such as prepositions are considered domain- 
independent and expected to appear in the original 
lexicon, and modification of the existing linking 
requirements (grammar adaptation) is outside the 
scope of this study. 

3.1 Extension of the lexicon 

The extension of the lexicon with external domain- 
specific knowledge is the most frequent approach 
to adaptation, provided that the resources are 
available for the domain. This can be done either 
manually or with automatic mapping methods. 

Here, we evaluate the heuristic lexicon mapping 
proposed by Szolovits (.2003 J . This mapping can 
be used to automatically add domain-specific ter- 
minology from an external specialized lexicon to 
the lexicon of a par ser. Words are mapped from a 
source lexicon (e.g. the domain lexicon) to a tai^get 
lexicon (e.g. the parser lexicon) based on their lex- 
ical descriptions. As these descriptions typically 
differ between lexicons, they cannot be transferred 



Suffix 


POS 


examples 


Suffix 


POS 


examples 


-ase 


noun 


synthetase, kinase 


-in 


noun 


actin, kanamycin 


-ity 


noun 


chronicity, hypochromicity 


-ion 


noun 


septation, reguion 


-on 


noun 


replicon, intron 


-ol 


noun 


glycosylphosphatidylinositol 


-ose 


noun 


isomaltotetraose, isomaltotriose 


-or 


noun 


cofactor, repressor/activator 


-yl 


noun 


hydroxy ethyl, hydroxy methyl 


-ine 


noun 


5-(hydroxymethyl)-2'-deoxyuridine 


-ide 


noun 


iodide, oligodeoxynucleotide 


-i 


noun 


casei, lactococci, termini 


-ic 


adjective 


glycolytic, ribonucleic, uronic 


-al 


adjective 


ribosomal, ribsosomal 


-ive 


adjective 


nonpermissive, thermosensitive 


-ar 


adjective 


intermolecular, intramolecular 


-ble 


adjective 


inducible, metastable 


-ous 


adjective 


exogenous, heterologous 


-ae 


latin adj. 


influenzae, tarentolae 


-us 


latin adj. 


pentosaceus, luteus, carnosus 


-um 


latin adj. 


japonicum, tabacum, xylinum 


-is 


latin adj. 


brevis, israelensis 


-fold 


adj ec tive/ adverb 


10-fold, 4.5-fold, five-fold 









Table 1 : Biomedical suffixes involved in the extension of the morpho-guessing rules 



directly from one lexicon to another. Instead, the 
mapping operates with sets of words that have the 
exact same lexical description in their respective 
lexicons. 

To assign a lexical description to a word w not 
in the target lexicon, the mapping finds words that 
have the exact same lexical description as w in the 
source lexicon, and that further have a descrip- 
tion in the target lexicon. Overlap in sets having 
the same descriptions is then used to select one of 
these target lexicon descriptions to assign to w. 

Szolovits applied the introduced mapping to 
extend the lexicon of LGP with terms from the 
UMLS Specialist Lexicon and observed that the 
mapping heuristic chose poor definitions for some 
smaller sets, for which the definitions were man- 
ually modified. The created UMLS dictionary ex- 
tension contains 121,120 words that do not appear 
in the original LGP dictionary. 

Szolovits observed that many of the phrases 
included in the extension "bear no specific lex- 
ical information in Specialist that is not obvi- 
ous from their component words". Additionally, 
phrases are parsed using the LGP idiom system, 
which does not assign internal structure to the 
phrases, complicating comparison against a refer- 
ence corpus. For these reasons, we evaluate the 
no-phrases version of the extension^. The ef- 
fect of this extension has also been considered by 
Pyysalo et al. (l2006l . 

3.2 Morphological clues 

Morphological clues can be exploited by LGP to 
predict the morpho-syntactic classes (hence syn- 
tactic behaviour) of unknown words. Specific do- 
mains are an interesting application for this type 
of adaptation because a great part of technical 
lexicons presents regular morphological features, 

http://www.cdm.csail.mit.edu/projects/t ext/| 



which, according to Mikheev (I1997t . obey mor- 
phological regularities of the general language. 
We observe that this assumption holds only par- 
tially because of the presence of foreign words in 
specialized texts and argue that a minimal mor- 
phological study of the corpus is necessary. Such 
studies have been performed, on the biomedical 
domain by Spyns ( 1994 ) and Aubin et al. (|2005 ). 

While many POS taggers employ morpholog- 
ical features to tag unknown words, domain ex- 
tension of a rule-based approach such as the LGP 
morpho-guessing system can be preferable in lex- 
ical adaptation to domains where resources such 
as tagged corpora are not available for training 
taggers. Further, the MG extension allow assign- 
ing specific rules at a greater granularity than POS 
tags. 

We have implemented and evaluated the exten- 
sion of the LGP morpho-guessing rules proposed 
by Aubin et al. (2005 1. This extension of 23 new 
suffixes for the biomedical domain is presented in 
Tabled Aubin et al. (120051) further identified in 
the corpus a small number of exceptions to these 
rules {"wherein", "kcal/mol" , "ultrafine" , etc.), 
which were manually added to the dictionary. 

3.3 POS tagging 

We finally propose to provide the parser with an 
input sentence enriched with POS tags. In order to 
retain the decision-making power of the parser and 
to avoid inconsistencies between tagged words 
and their entry in the parser lexicon (see Grover 
et al. (20051), we restrict the use of POS tags to 
unknown words only. 

We modified LGP so that POS information can 
be passed to the parser by appending POS tags to 
input words (e.g. actin/NN). We further modified 
the parser so that when an unknown word is given 
a POS tag, the parser assigns linking requirements 



Tag 


Description 


LGP rule 


NN 


common noun, sing. 


words.n.4 


NNS 


common noun, pi. 


words.n.2.s 


NNP 


proper noun, sing. 


CAPITALIZED-WORDS 


NNPS 


proper noun, pi. 


PL-CAPITALIZED- WORDS 


JJ 


adjective, base 


UNKNOWN-WORD.a 


JJR 


adjective, comparative 


words, adj. 2 


JJS 


adjective, superlative 


words, adj. 3 


VB 


verb, base 


words.v.6.1 


VBD 


verb, past tense 


words.v.6.3 


VBZ 


verb, present 3d pers. 


S-WORDS.v 


VBP 


verb, present non-3d 


words.v.6.1 


VBG 


verb, gerund 


ING-WORDS 


VBN 


verb, past participle 


ED-WORDS 


CD 


number 


NUMBERS 


RB 


adverb, base 


words, ad v. 1 



Table 2: POS tags mapping to LGP rules 

to the words based on a given mapping from POS 
tags to LGP dictionary entries. We defined such 
a mapping, presented in Table |2l for Penn tagset 
POS categories corresponding to content words. 
FW (foreign words) and SYM (symbols) tags were 
not mapped due to their syntactic heterogeneity. 
Existing LGP rules were used to define the behav- 
ior of POS-mapped words, and the most generic 
applicable rule was chosen in each case. For in- 
stance, words tagged "NN" map the rule for nouns 
that can be either mass or countable, so that there 
is no constraint on determiners. 

To evaluate the effect of using both a general 
and a domain tagger, the experiments were made 
using two taggers: the Brill tagger^ trained on 
the Wall Street Journal (general language) and the 
GENIA Tagger^ ( Tsuruoka et al., 2005] ) trained on 
the biomedical corpus GENIA. A detailed evalua- 
tion and error analysis of GENIA Tagger is given 
in ( |Tsuruoka et al, 2005l i, finding 98% accuracy 
on two biomedical corpora. On this basis, we es- 
timate the tagging accuracy at 81% for the Brill 
tagger and 97% for GENIA Tagger. This estimate 
was performed by manually checking tagging di- 
vergences between the two taggers on one of our 
corpora. 

4 Evaluation protocol 
4.1 Corpora 

Two corpora are used for the present evaluation: 
"interaction" and "transcript", both built in the 
context of IE in biomedical texts. Both corpora 
were tokenized and cleared of bibliographic refer- 
ences in a preprocessing step. 



Interaction contains 542 sentences (16,874 to- 
kens) annotated for dependencies using the Link 
Grammar annotation scheme. 600 sentences were 
initially selected randomly from Pubmed^ with the 
condition that they contain at least two proteins 
for which a known interaction was entered into the 
DIP database'. 58 sentences consisting only of a 
nominal phrase were then excluded as LGP does 
not, by design, parse them^. Each sentence was 
separately annotated by two annotators, and differ- 
ences were resolved by discussion. Links to punc- 
tuation were excluded, and link types were not an- 
notated. A total of 14,242 links were annotated in 
these sentences. 

The transcript corpus is made of 16,989 sen- 
tences (438,390 tokens) consisting of the result 
for the query "Bacillus subtilis transcription " on 
Pubmed. It was not annotated. 

Both corpora are used to characterize the vocab- 
ulary coverage by the different methods applied in 
LGP. The annotated interaction corpus is also used 
as the reference corpus for the evaluation of pars- 
ing performance. 

4.2 Evaluation criteria 

We first evaluate vocabulary coverage in the orig- 
inal and extended versions of LGP. We present the 
contribution of each method (dictionary, moipho- 
guessing, POS-mapping and unknown words) im- 
plemented in LGP to handle vocabulary. Results 
are given separately for types (i.e. distinct forms) 
and tokens (i.e. occurrences) in the corpus. 

We assess the ambiguity of the parsing process 
with two criteria: parsing time and linkage num- 
bers. Parsing time is immediately relevant to ap- 
plications of the parser to systems where large cor- 
pora must be parsed. Linkage numbers are a more 
direct measure of the ambiguity of parsing a sen- 
tence. For each sentence, the parser enumerates 
the total number of linkages allowed by the gram- 
mar. By taking the ratio of the number of linkages 
allowed by two versions of the parser, we can es- 
timate the relative increase or decrease in ambigu- 
ity. We report the per-sentence averages of both 
parsing time and linkage number ratios. 

To determine the parsing performance of the 
extensions of LGP, we used each of the extensions 



http://research.microsoft.com/users/brill/ 
^http://www-tsujii.is.s. u-tokyo.ac.jp/GENIA/tagger/ 



^http://www. pubmed.com/ 
^http://dip.doe-mbi.ucla.edu/ 

*This limitation could be overcome by modification of the 
grammar, but here we decided to avoid grammar adaptation 
and evaluate the parser with respect to its intended coverage. 



Interaction - type,' 



Interaction - tokens 



One UMLS xMG POS 



lOOOf - 
809; - 
60% - 
40% - 
20% - 
- 0% L 



Oris UMLS xMG POS 



100% - 
80% - 
60%. - 
40%. - 
20% - 
- 0% L 



Transcript - types 

nnnw 



Oris UMLS xMG POS 



Transcript - tokens 



100* - 
80% - 
60% - 
40% - 
20% - 
- 0% ' — 



I Unknown 
■ POS-map 
□ MG 
I I Dictionary 



Orig UMLS xMG POS 



Figure 1 : Vocabulary handling in the interaction and transcript corpora: the fraction of words and types 
covered by each method in the original LGP and the three adaptations. Coverage for the POS adaptation 
is shown only for GENIA Tagger as the coverage of the Brill tagger was essentially identical. 



to parse the interaction corpus sentences and com- 
pared the produced linkages against the reference 
corpus. For each sentence, we determine the re- 
call, i.e. the fraction of links in the reference cor- 
pus that were present in parses returned by LGP^. 
We report average recall for both the first linkages 
as ordered by the LGP heuristics and, to sepa- 
rate the effect of the heuristics from parser perfor- 
mance, also the best linkages, that is, the linkages 
with the most annotated links recovered. We fur- 
ther separately evaluate overall performance and 
performance for the subset of sentences where no 
timeouts occurred in parsing. 

Experiments were performed on a 2.8GHz Intel 
Xeon with parameter values timeout=60 sec, 
limit = 1000, islands-ok=true. Default 
values were used for other parameters. The statis- 
tical significance of differences between the orig- 
inal parser and each of the modifications is as- 
sessed using the Wilcoxon signed-ranks test for 
overall first linkage performance, using the Bon- 
ferroni correction for multiple comparisons. 

5 Results 

In this section we present the evaluation results for 
the original LGP (Orig), LGP with the UMLS dic- 
tionary extension (UMLS), LGP with the morpho- 
guessing extension (xMG) and LGP with the POS 
extension, evaluated with the two taggers. Brill 
and GENIA tagger (GT). 

5.1 Vocabulary coverage 

Figure n shows the proportion of vocabulary cov- 
ered by each method on the interaction and tran- 
script corpora . 

'Note that for connected, acyclic dependency graphs, pre- 
cision equals recall: for each missing link, there is exactly 
one extra link. While there are some exceptions to connect- 
edness and acyclicity in both LGP linkages and the annota- 
tion, we believe recall can be used as a fair estimate of overall 
performance. 



The comparison of the results on types and to- 
kens shows that the dictionary has a good recog- 
nition rate on frequent types for both the original 
and the UMLS versions. By contrast, the MG and 
POS-map methods contribute for the recognition 
of a great number of types (particularly in tran- 
script) but few tokens. In addition, the discrep- 
ancy on types between the two corpora for the 
dictionary method in all versions reflects the in- 
creasing presence of low frequency non-canonical 
words with the growing size of the corpus. In- 
terestingly, we find that the reduction in unknown 
words (black part in the charts) due to the UMLS 
and xMG extensions is roughly similar, despite the 
former containing over 100,000 new words and 
the latter only 23 new rules. The POS extension, 
as expected, reduces the part of unknown words to 
almost null. 

The remaining unknown words are of differ- 
ent nature for the extensions. Quite surprisingly, 
UMLS lacks a great number of species names (nu- 
merous in transcript) and frequent gene or protein 
names (e.g. lacZ, 78 occurrences in transcript). In 
addition, the Specialist Lexicon version used here 
contains no complex terms which prevents from 
detecting words like vitro and vivo used in the fre- 
quent terms in vitro and in vivo. The evaluated 
xMG extension cannot handle gene/protein names 
either, and also misses frequent technical terms 
that have no specific morphological features, such 
as sigma, mutant and plasmid. 

To assess lexicon coverage, we measured the 
contribution^''^ and the recognition^^ of the UMLS 
dictionary extension. We find that while the con- 
tribution of the UMLS dictionary extension is very 
low, with 0.54% on interaction and 2.3% on tran- 
script, the recognition of the dictionary method is 
augmented significantly by the UMLS extension 
(51% to 71% for interaction and 25% to 40% for 

'"proportion of types of the resource found in the corpus 
"proportion of types of the corpus found in the resource 
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All, first linka^ 




74.2 
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4.7 


76.0 7.0 


75.4 


4.7 


76.8 10.1 


All, best linkaj 


?e 


82.7 


83.5 


4.6 


84.5 10.4 


83.7 


5.8 


85.3 15.0 


NT, first linka§ 


;e 


78.0 


78.1 


0.5 


78.9 4.1 


78.0 


0.0 


79.4 6.4 


NT, best linkaj 




87.4 


86.9 


-4.0 


88.0 4.8 


86.7 


-5.6 


88.3 7.1 


P 


N/A 


p ^ 


0.06 


p < 0.01 


p ^ 


0.07 


p < 0.01 



Table 3: Performance. First linkage denotes the linkage ordered first by the parser heuristics and best 
linkage the best performance achieved by any linkage returned by the parser. Results marked NT are for 
the subset of sentences where no timeouts occurred for any of the modifications. A columns give relative 
decrease in error with respect to the original LGP, and p values are for "All, first linkage" performance. 



transcript). Nevertheless, as the size of the dic- 
tionary does not significantly penalize the parsing 
time with LGP, even a generic resource that con- 
tributes relatively little can be beneficial. 



5.2 Ambiguity 

The results of measuring the effect of the various 
extensions on ambiguity are given in Table|5 



Metric 


Orig 


UMLS 


xMG 


Brill 


GT 


Time 
Lkg. ratio 


15.4s 
1 


9.9s 
0.67 


10.8s 
0.68 


8.8s 
0.70 


8.6s 
0.66 



Table 4: Ambiguity. Time is average parsing 
time per sentence, linkage ratio is average of per- 
sentence linkage number ratios. 



The reduction in the number of unknown words 
for the UMLS and xMG extensions is coupled 
with a roughly 30% reduction in both parsing time 
and linkage numbers. Although the POS exten- 
sion essentially eliminates unknown words, it only 
gives a decrease in parsing time and linkage num- 
bers that roughly mirrors the effect of the UMLS 
and xMG extensions. 

None of the extensions achieves more than 35% 
reduction in linkage numbers or more than 45% 
reduction in parsing time. This may reflect struc- 
tural ambiguity in the language and suggest a limit 
on how much ambiguity can be controlled through 
these lexical adaptation approaches. 

5.3 Performance 

The evaluation results are presented in Table |3l 
We find that in addition to increased efficiency, all 
of the extensions offer an increase in overall pars- 
ing performance compared to the original LGP for 
both the first and best linkages. Remarkably, this 
increase occurs even with the Brill tagger, which 
was trained on general English. In overall perfor- 
mance, the UMLS extension and the POS exten- 



sion with the Brill tagger are roughly equal. The 
xMG extension outperforms both, and the POS ex- 
tension with GENIA Tagger has the best perfor- 
mance of all considered extensions. 

The positive effect of the extensions on parsing 
performance is linked to the reduced number of 
timeouts that occurred when parsing. Effects not 
related to time limitations can be studied on sen- 
tences where no timeouts occurred (NT). Here the 
effects of the extensions diverge: for the first link- 
age, performance with the UMLS extension and 
the POS extension with the Brill tagger essentially 
matches that of the unmodified LGP, while per- 
formance with xMG and GENIA Tagger remains 
better. For the best linkage, we observe a nega- 
tive effect from the UMLS extension, indicating 
that for some words the unknown word handling 
mechanism of LGP finds correct links that are not 
allowed by the linking requirements given to those 
words in the extended dictionary. This suggest that 
some errors have occurred in the automatic map- 
ping process^^. We similarly observe the expected 
decrease in performance for the Brill tagger for the 
best linkage, reflecting tagging errors. 

Even for the best linkage in sentences where 
no timeouts occurred, the performance with the 
xMG extension and the POS extension with GE- 
NIA Tagger is better than that of the original LGP. 
These extensions can thus assign more appropriate 
linking requirements for some words than the un- 
known word system of LGP. This indicates high 
tagging accuracy for GENIA Tagger as well as 
an appropriate choice of linking requirements for 
both extensions, and suggests some limitation in 
the unknown word system of LGP. 

Despite significant improvements in parsing 
performance, the best performance achieved by 

''An example of one such error is in the mapping of abbre- 
viations (e.g. MHC) to countable nouns, leading to failures 
to parse in the absence of determiners. 









T T^/^T 




XlVlvj 




T T^/^T Q 






IVAt Li L\y 




OH Cr 


A xMG 

OC AJ.VJ.vJ 




& POS 




& POS 




All 3 A 


All, first linka^ 




74.2 


75.7 


5.8 


76.8 


10.1 


76.0 


7.0 


76.1 7.4 
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82.7 


83.7 
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85.3 


15.0 


84.2 
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84.2 8.7 


NT, first linkaj 


;e 


78.0 


78.4 


1.8 


79.3 


5.9 


78.6 


2.7 


78.7 3.2 


NT, best linkaj 




87.4 


87.0 


-3.2 


88.2 


6.3 


87.2 


-1.6 


87.1 -2.4 


P 


N/A 


p < 0.05 


p < 0.01 


p < 0.01 


p < 0.01 



Table 5: Performance for combinations of the extensions. 



any LGP extension is 88%. This may again sug- 
gest a limit on what performance can be achieved 
through the lexical adaptation approaches. 

5.4 Combinations of tlie Extensions 

The UMLS, xMG and POS tagging extensions 
are to some extent complementary as their cover- 
age of the corpus vocabulary does not completely 
overlap. The dictionary extension provides the 
most frequent domain-specific lexicon while the 
xMG extension has the advantage of being able 
to handle non-canonical (e.g. mutation/deletion, 
DNA-regions) and rare words and misspellings. 
The POS extension can benefit from the context- 
sensitiveness of the tagger to disambiguate words. 

We evaluated all possible combinations of the 
three extensions. In these experiments we only 
used GENIA Tagger for the POS extension. The 
results are given in tables |5] and [S] 







UMLS 


xMG 


UMLS 




Metric 


Orig 


&xMG 


&POS 


&POS 




Time 


15.4s 


9.5s 


8.7s 


8.3s 




Lkg. ratio 


1 


0.67 


0.59 


0.62 





Table 6: Ambiguity for combinations of the exten- 
sions. 

On ambiguity, we observe small advantages for 
many of the combinations, but rarely more than a 
10% reduction for either metric compared to the 
simple extensions. The effect of the combinations 
on overall performance is mixed. While all com- 
binations outperform the original LGP, combina- 
tions involving the UMLS extension appear to per- 
form worse than those that do not, while combina- 
tions involving the xMG and POS extensions per- 
form better. For sentences where no timeouts oc- 
curred the effect is simple: for the best linkage, all 
combinations involving the UMLS extension per- 
form worse than the original LGP; only the com- 
bination of the xMG and POS extensions is better. 

The performance of the best combination ap- 
proach essentially matches that of the POS exten- 



sion with GENIA Tagger alone, suggesting that no 
further benefit can be derived from combinations 
when an accurate domain tagger is available. 

6 Conclusions and Future Work 

We have studied three lexical adaptation ap- 
proaches addressing biomedical domain vocabu- 
lary not found in the lexicon of the Link Gram- 
mar Parser: automatic lexicon expansion, surface 
clue based morpho-guessing, and the use of a POS 
tagger. We found that in a time-limited setting, 
any approach resolving unknown words can im- 
prove efficiency and overall performance. In more 
detailed evaluation, we found that the automatic 
dictionary extension and the use of a general En- 
glish POS tagger can reduce performance, while 
the morpho-guessing approach and the use of a 
domain-specific POS tagger had only positive ef- 
fects. We found no further benefit from combina- 
tions of the three approaches. 

Generally, our results suggest that when avail- 
able, a high-quality domain POS tagger is the 
best solution to unknown word issues in the do- 
main adaptation of a general parser, here provid- 
ing an overall 10% relative reduction in error com- 
bined with a 45% decrease in parsing time. In 
the absence of such a resource, the use of a gen- 
eral POS tagger is a poor substitute, and can lead 
to decreased performance. The use of heuristic 
methods for lexicon expansion carries the risk of 
mapping errors and should be accompanied by an 
evaluation of the effect on parsing performance. 
Conversely, surface clues can provide remarkably 
good coverage and performance when tuned to the 
domain, here using as few as 23 new rules. 

Our implementation of the adaptations to LGP 
combines the morpho-guessing extension with the 
capability to use information from a POS tagger. 
Thus, the adapted parser is faster and more accu- 
rate than the unmodified LGP in parsing biomedi- 
cal texts both when used as such and when used 



together with a domain POS tagger. Further, 
both extensions are implemented so that defining 
other morpho-guessing rules and POS-mappings 
is straightforward, facilitating adaptation of the 
modified parser also to other domains. The 
adapted LGP is available under an open-source li- 
cence at http : / /www . it . utu . f i/biolg. 

While we found that the considered approaches 
can significantly improve efficiency and parsing 
performance, our results also indicate some limita- 
tions for lexical adaptation. As future work, com- 
plementary approaches addressing grammar adap- 
tation, text preprocessing, handling of complex 
terms, improved parse ranking and named entity 
recognition can be considered to further improve 
the appUcabiUty of LGP to the biomedical domain. 
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